class: center, middle, inverse, title-slide .title[ # Introduction to Map Making and Census Data ] .subtitle[ ## Yale’s BIS 679 (Advanced Statistical Programming) Guest Lecture ] .author[ ### Josemari Feliciano ] .institute[ ### American University ] .date[ ### December 1, 2025 NOTE: NOT FINAL DRAFT YET UNTIL CLASS TIME. ] --- ## Discuss US-Based Geospatial and Demographic Data: Goals Today - Introduce you to various geographic boundaries (e.g., counties, tracts, block groups) we work with here in the US. - Introduction to Census Geocoding tools using (a) their web interface and (b) the tidygeocoder package. - Provide an overview of various datasets offered by the US Census Bureau. - Provide a detailed introduction to the American Community Survey (ACS) data. - Learn how to use R packages (e.g., censusapi, tidycensus) to seamlessly download and work with ACS data. - Learn the basics of static map making using ggplot2. --- ## Github File Repository Relevant files are located in this github repo: https://github.com/jmtfeliciano/BISGeospatialGuestLecture Let us quickly spend a minute to view materials within the repo. But notable files here: 1. All scripts found in this slide deck are also shared in the `GuestLectureScriptsandPractice.qmd` where you can run the query yourself in RStudio. 2. There is an optional file (`PostClassPractice.qmd`) that you can work on after the class should you want to practice. It contains a lot of practice to refine your skills, reviews a lot of concepts we will go over today, and teaches you how to create 'dark mode' images. 3. There is a `dataset` folder that contains multiple files for the `PostClassPractice.qmd`. --- ## Signing up for Census API key Let us pause for a minute or two before we continue with the workshop. Go to and sign up for a quick API key from the Census: https://api.census.gov/data/key_signup.html We will need the API key for censusapi and tidycensus packages. All scripts found in this slide are also shared in the `GuestLectureScriptsandPractice.qmd` where you can --- ## Geographic Identifiers (GEOIDs): The Basics. Geographic identifiers (or GEOIDs) are numeric codes that uniquely identify all administrative/legal and statistical geographic areas. - Without a common identifier among geographic and demographic datasets, researchers and other stakeholders would have a difficult time pairing the appropriate demographic data with the appropriate geographic data, thus considerably increasing data processing times and the likelihood of data inaccuracy. Here in the US, we primarily use what are called Federal Information Processing Series (FIPS) codes. - Many US-based datasets would label their geographic and demographic datasets with either GEOID or FIPS to indicate the relevant code. Datasets use GEOID and FIPS interchangeably. - If you are working with spatial data, it is best to have the FIPS code to easily merge the datasets. --- ## Geographic hierarchies <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#images/geography.png" alt="Figure 1. Geographic hierarchies in the United States. Typically, the key notable geographic levels scientists and policy makers concern themselves with are: (1) State, (2) County, (3) Census Tract, (4) Census Block, and (5) Zip Code Tabulation Areas (ZCTAs)." width="60%" /> <p class="caption">Figure 1. Geographic hierarchies in the United States. Typically, the key notable geographic levels scientists and policy makers concern themselves with are: (1) State, (2) County, (3) Census Tract, (4) Census Block, and (5) Zip Code Tabulation Areas (ZCTAs).</p> </div> --- ## Geographic hierarchies <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#images/geographyv2.png" alt="Figure 2. Geographic hierarchies in the United States. Visual representation of how counties, census tracts, block groups, and blocks are nested within one another." width="60%" /> <p class="caption">Figure 2. Geographic hierarchies in the United States. Visual representation of how counties, census tracts, block groups, and blocks are nested within one another.</p> </div> --- ## Federal Information Processing Standards (FIPS) <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#images/fips.png" alt="Figure 3. FIPS standards across geographic hierarchies. Again, GEOID/FIPS codes are typically what we use to identify both the geographic level and specific location we are working with. FIPS and GEOID are often used synonymously with one another." width="60%" /> <p class="caption">Figure 3. FIPS standards across geographic hierarchies. Again, GEOID/FIPS codes are typically what we use to identify both the geographic level and specific location we are working with. FIPS and GEOID are often used synonymously with one another.</p> </div> --- ## States FIPS You can get this from many websites. This specific list is from [the Census Bureau directly](https://www2.census.gov/geo/docs/reference/state.txt).
--- ### A Quick Detour: Non-Census Data at County and Tract Level Federal agencies and researchers are increasingly using the CDC/ATSDR Social Vulnerability Index (SVI). __From CDC:__ "Natural disasters and infectious disease outbreaks can pose a threat to a community’s health. Socially vulnerable populations are especially at risk during public health emergencies because of factors like socioeconomic status, household composition, minority status, or housing type and transportation." __SVI Availability:__ Data are available at county- and tract- level. __Index Range:__ The index (labelled RPL_THEMES in the dataset) is a score between 0 (least vulnerable) and 1 (most vulnerable). __Note:__ Some CDC data sets are currently down (permanently or temporarily) as federal agencies are assessing their compliance with President Trump's Executive Orders. The site for CDC's SVI dataset is currently down. --- ### Social Vulnerability Index Scoring Breakdown <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#images/svi_breakdown.png" alt="Figure 4. Factors that impact the index for the 2022 SVI. The latest SVI data is for 2022. The dataset calculates the SVI using Census ACS data (more on this later) for 2018-2022." width="60%" /> <p class="caption">Figure 4. Factors that impact the index for the 2022 SVI. The latest SVI data is for 2022. The dataset calculates the SVI using Census ACS data (more on this later) for 2018-2022.</p> </div> --- ### Partial SVI Data at County-Level:
--- ### Partial SVI Data at Census Tract-Level:
--- ## Coordinates: Latitude vs Longitude The numbers are in decimal degrees format and range from -90 to 90 for latitude and -180 to 180 for longitude. <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#images/longlat.svg" alt="Figure 5. Image showing longitude (x-axis) and latitude (y-axis)." width="98%" /> <p class="caption">Figure 5. Image showing longitude (x-axis) and latitude (y-axis).</p> </div> --- ## Free geocoding tools You likely know which county you are in. However, you likely do not know the specific census tract or block group you are in. Luckily, Census has free geocoding tools (e.g., web interface, API) that can help us! Web interface link is: https://geocoding.geo.census.gov/geocoder/geographies/address?form Let us do a quick live demo of this tool. --- ### Free geocoding tools in R: forward geocoding You can use the tidygeocoder package in R. The code below is an example of forward geocoding (addresses ⮕ coordinates). By default, geocode() uses Nominatim (a geocoding software package) to perform the task. .tiny[ ``` r # run if not installed before: install.packages("tidygeocoder") library(tidygeocoder) addresses_df <- data.frame(address = c("60 College St, New Haven, CT")) geocode(addresses_df, address = address) ``` ``` ## Passing 1 address to the Nominatim single address geocoder ``` ``` ## Query completed in: 1 seconds ``` ``` ## # A tibble: 1 × 3 ## address lat long ## <chr> <dbl> <dbl> ## 1 60 College St, New Haven, CT 41.3 -72.9 ``` ] --- ### Free geocoding tools in R: forward geocoding It can process multiple addresses: ``` r # run if not installed before: install.packages("tidygeocoder") library(tidygeocoder) addresses_df_v2 <- data.frame(address = c("1600 Pennsylvania Avenue, Washington, DC", "1313 Disneyland Dr, Anaheim, CA")) geocode(addresses_df_v2, address = address) ``` ``` ## Passing 2 addresses to the Nominatim single address geocoder ``` ``` ## Query completed in: 2 seconds ``` ``` ## # A tibble: 2 × 3 ## address lat long ## <chr> <dbl> <dbl> ## 1 1600 Pennsylvania Avenue, Washington, DC 38.9 -77.0 ## 2 1313 Disneyland Dr, Anaheim, CA 33.8 -118. ``` --- ### Free geocoding tools in R: forward geocoding If you are interested in determining the geographies (e.g., county, tract information), we need to use the census method which leverages the Census API. For full output, see next slide. .tiny[ ``` r # run if not installed before: install.packages("tidygeocoder") library(tidygeocoder) addresses_df_v3 <- data.frame(address = c("26 Plympton St, Cambridge, MA")) results_df <- geocode(addresses_df_v3, address = address, method = "census", full_results = TRUE, api_options = list(census_return_type = 'geographies')) ``` ] --- ### Some extractable information from results_df .tiny[ ``` r # Script you need to run to see county information results_df$geographies.Counties ``` ``` ## [[1]] ## GEOID CENTLAT AREAWATER STATE BASENAME OID LSADC FUNCSTAT ## 1 25017 +42.4853699 75219797 25 Middlesex 27590260056194 06 N ## INTPTLAT NAME OBJECTID CENTLON COUNTYCC COUNTYNS ## 1 +42.4817215 Middlesex County 2374 -071.3915833 H4 00606935 ## AREALAND INTPTLON MTFCC COUNTY ## 1 2118363247 -071.3949160 G4020 017 ``` ``` r # Script you need to run to see county information results_df$`geographies.Census Tracts` ``` ``` ## [[1]] ## GEOID CENTLAT AREAWATER STATE BASENAME OID LSADC ## 1 25017353700 +42.3745268 0 25 3537 20790260102279 CT ## FUNCSTAT INTPTLAT NAME OBJECTID TRACT CENTLON AREALAND ## 1 S +42.3745268 Census Tract 3537 17567 353700 -071.1122878 532087 ## INTPTLON MTFCC COUNTY ## 1 -071.1122878 G5020 017 ``` ] --- ### What other geographic details are available? Pay close attention to column names with a geographies prefix. .tiny[ ``` r colnames(results_df)[7:17] ``` ``` ## [1] "geographies.States" ## [2] "geographies.Combined Statistical Areas" ## [3] "geographies.County Subdivisions" ## [4] "geographies.Urban Areas" ## [5] "geographies.Incorporated Places" ## [6] "geographies.Counties" ## [7] "geographies.2024 State Legislative Districts - Upper" ## [8] "geographies.2024 State Legislative Districts - Lower" ## [9] "geographies.2020 Census Blocks" ## [10] "geographies.Census Tracts" ## [11] "geographies.119th Congressional Districts" ``` ] --- ### Free geocoding tools in R: reverse geocoding You can use the tidygeocoder package in R. The code below is an example of reverse geocoding (coordinates ⮕ addresses). .tiny[ ``` r reverse_geo(lat = "41.30374", long = "-72.93216") ``` ``` ## Passing 1 coordinate to the Nominatim single coordinate geocoder ``` ``` ## Query completed in: 1 seconds ``` ``` ## # A tibble: 1 × 3 ## lat long address ## <dbl> <dbl> <chr> ## 1 41.3 -72.9 Laboratory of Epidemiology and Public Health, 60, College Street,… ``` ] --- ## List of Census Surveys and Datasets .pull-left[ - The US Census Bureau conducts 130+ surveys each year. - A detailed list can be accessed by clicking [this](https://www.census.gov/programs-surveys/surveys-programs.html). - Let us quickly explore this list using a web browser. ] .pull-right[ <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#images/census_survey.png" alt="Figure 6. Screenshot of page showing all the surveys performed by the US Census Bureau." width="100%" /> <p class="caption">Figure 6. Screenshot of page showing all the surveys performed by the US Census Bureau.</p> </div> ] --- ### Quick overview of Small Area Health Insurance Estimates (SAHIE) .pull-left[ You may access the large yearly SAHIE datasets by clicking [this](https://www2.census.gov/programs-surveys/sahie/datasets/time-series/estimates-acs/). __Recommended:__ Accessing the data using the SAHIE interactive tool to minimize data cleaning. The tool can be accessed by clicking [this](https://www.census.gov/data-tools/demo/sahie/#/). This greatly minimizes data cleaning/subsetting tasks. ] .pull-right[ <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#images/sahie.png" alt="Figure 7. Screenshot of the SAHIE dashboard." width="100%" /> <p class="caption">Figure 7. Screenshot of the SAHIE dashboard.</p> </div> ] --- ## SAHIE interactive tool Let us quick a live demo on downloading county-level SAHIE data for Alabama. Our goal is to download and clean data to be ready for reading into R. --- ## The American Community Survey (ACS) Data The American Community Survey (ACS) is an ongoing yearly survey. Arguably the most widely used data set from Census Bureau. __Census Bureau:__ "[ACS] is the premier source for detailed population and housing information about our nation." __Two yearly versions:__ ACS 1-year and ACS 5-year. Note: For older data (2007-2013), 3 year estimates exist. - ACS 1-year estimates data for areas with populations of 65,000+. - ACS 5 year estimates data for all areas regardless of population size. - Many datasets provided by other federal agencies are subsets or are created in part using ACS data. __Language:__ If someone says they're using the 5-year 2020 ACS data, they're referring to the 2016-2020 5-year ACS data. --- ## The American Community Survey (ACS) Data Many ACS-related documentation online. __What to look for and remember:__ Table Shells. Table shells (particularly for detailed tables) is a comprehensive list of variable documentation for ACS data. - Table shells are provided [yearly](https://www.census.gov/programs-surveys/acs/technical-documentation/table-shells.html). If you are doing a longitudinal study using ACS data (e.g., a 10 year study), it might be a good idea to check the relevant yearly shell tables to see if variables of interest are available for all years. - Alternative ACS documentation: API documentation also lists all the available ACS variables [here](https://api.census.gov/data/2022/acs/acs5/variables.html). Let us download the ACS 2024 Table Shells and go over it together briefly. Let us search for disability-related variables. --- ## The American Community Survey (ACS) Table Shells .pull-left[ <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#images/TableShell.png" alt="Figure 8. Excel Table Shell." width="98%" /> <p class="caption">Figure 8. Excel Table Shell.</p> </div> ] .pull-right[ <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#images/API.png" alt="Figure 9. API Table Shell." width="98%" /> <p class="caption">Figure 9. API Table Shell.</p> </div> ] UniqueID from the Excel Table Shell (or Name in the API Table Shell) represents the variable(s) we want to get from the Census ACS data. Technically, Name in the API Table Shell is more accurate since it has the suffix "-E" which represents "estimates". Do note that if you're using the Excel Table Shell, you'll eventually need to add the suffix "-E" when requesting data from the Census API. --- ### Before we dive into censusapi I will show you one quick side project I did--probably spent < 2 hours--where I created a simple webpage that uses the ACS data I pulled to help friends who are doing community outreach: https://jmtfeliciano.github.io/FilipinoPopulationDMVArea.html --- ### censusapi package: pulling data into R. Suppose we are interesting in county-level population data. According to the ACS table shell, the variable we need to use is: B01001_001E. We can use the censusapi package to get census data directly into R. .tiny[ ``` r # run this if not installed: install.packages("censusapi") library(censusapi) Sys.setenv(CENSUS_KEY="") # put your API key here population_data <- getCensus( name = "acs/acs5", # requests ACS5 data vintage = 2022, # requests 2022 data vars = c("B01001_001E"), #requested variable region = "county:*") #requested geography head(population_data) ``` ``` ## state county B01001_001E ## 1 01 001 58761 ## 2 01 003 233420 ## 3 01 005 24877 ## 4 01 007 22251 ## 5 01 009 59077 ## 6 01 011 10328 ``` ] --- ## Examples of Other Important censusapi calls .tiny[ An example of asking for multiple variables: ``` r data1 <- getCensus(name = "acs/acs5", vintage = 2019, vars = c("B28002_001E", "B28002_002E"), region = "county:*") ``` An example of asking for Connecticut-only county-level data (Note: CT's FIPS code is 09). ``` r data2 <- getCensus(name = "acs/acs5", vintage = 2019, vars = c("B28002_001E", "B28002_002E"), region = "county:*", regionin = "state:09") ``` An example of asking for Missouri-only tract-level data (Note: MO's FIPS code is 29). ``` r data3 <- getCensus(name = "acs/acs5", vintage = 2019, vars = c("B28002_001E", "B28002_002E"), region = "tract:*", regionin = "state:29") ``` ] --- ## Quick Detour: Merging data Merging data from another data frame (table) into another data frame (table) using `left_join()`. Suppose superheroes and publishers are two data frames: <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#images/left_join.png" alt="Figure 10. Visualizing left_join(). Excerpt of resources I created last semester for my DATA 412/612 course. I like to think of left_join as supplementing a table with data from another table. Note: This works because both tables have a variable named 'publisher' (this required common variable is called a join key)." width="50%" /> <p class="caption">Figure 10. Visualizing left_join(). Excerpt of resources I created last semester for my DATA 412/612 course. I like to think of left_join as supplementing a table with data from another table. Note: This works because both tables have a variable named 'publisher' (this required common variable is called a join key).</p> </div> --- ## The typical recipe: Making Maps the Simple Way 1. We need a __shapefile file__ which is a digital format for storing geographic location and associated attribute information (e.g., points, lines, or polygons). Think of boundaries, shapes, geometric information to delineate locations and boundaries. We will use the __tigris package__ to get the shapefiles directly from the US Census Bureau, so we don't have to work and manually load actual shapefiles (.shp) into R. 2. We need data to map we are interested in. 3. Merge data (e.g., SVI data) to the shapefile (which is also a data frame). 4. Leverage ggplot2 package to render the map. </br> __For this example:__ We will map the SVI data from the CDC for Missouri. You may download the relevant file from my Github repo (link in two slides). Partial data is printed on next page to study the data together. __Before we proceed:__ Run `install.packages("tigris")` via your R console if you don't have it installed. --- ## Partial MO's 2019 Tract-level SVI Data
--- ## Creating SVI Map for MO .tiny[ Step 1: Retrieve shapefile needed. ``` r library(tigris) mo_shape_file <- tracts(state = "MO", year = 2019) # not required but nice to visualize data head(mo_shape_file) ``` ``` ## Simple feature collection with 6 features and 12 fields ## Geometry type: POLYGON ## Dimension: XY ## Bounding box: xmin: -93.32762 ymin: 37.06259 xmax: -91.09394 ymax: 38.37489 ## Geodetic CRS: NAD83 ## STATEFP COUNTYFP TRACTCE GEOID NAME NAMELSAD MTFCC ## 1 29 055 450302 29055450302 4503.02 Census Tract 4503.02 G5020 ## 2 29 055 450102 29055450102 4501.02 Census Tract 4501.02 G5020 ## 3 29 055 450200 29055450200 4502 Census Tract 4502 G5020 ## 4 29 055 450400 29055450400 4504 Census Tract 4504 G5020 ## 5 29 015 460400 29015460400 4604 Census Tract 4604 G5020 ## 6 29 229 490300 29229490300 4903 Census Tract 4903 G5020 ## FUNCSTAT ALAND AWATER INTPTLAT INTPTLON ## 1 S 59019556 54839 +38.0699995 -091.3834407 ## 2 S 215515312 158937 +38.1505661 -091.1929142 ## 3 S 785265618 714683 +37.9120761 -091.2086380 ## 4 S 518540939 475755 +37.8958096 -091.3892205 ## 5 S 216350354 11553444 +38.3016635 -093.1718555 ## 6 S 335405942 257629 +37.1176895 -092.5083341 ## geometry ## 1 POLYGON ((-91.42897 38.0501... ## 2 POLYGON ((-91.31192 38.1507... ## 3 POLYGON ((-91.3684 38.09352... ## 4 POLYGON ((-91.52872 37.7942... ## 5 POLYGON ((-93.32762 38.2696... ## 6 POLYGON ((-92.68554 37.0748... ``` ] --- ## Creating SVI Map for MO .tiny[ Step 2: Load MO SVI data into R (finalize data we want to map) ``` r library(tidyverse) mo_svi_data <- read_csv("https://raw.githubusercontent.com/jmtfeliciano/teachingdata/refs/heads/main/MissouriSVI2019.csv") |> mutate(GEOID = as.character(FIPS)) # rename FIPS into GEOID ``` Step 3: Merge SVI data into shapefile then create new shapefile. ``` r mo_shape_file_v2 <- left_join(mo_shape_file, mo_svi_data) ``` Step 4: Plot map (Note: RPL_THEMES is the SVI variable). ``` r ggplot(data = mo_shape_file_v2, mapping = aes(fill = RPL_THEMES)) + geom_sf() ``` ] --- ## Generating Map .pull-left[ .tiny[ ``` r ggplot(data = mo_shape_file_v2, aes(fill = RPL_THEMES)) + geom_sf() ``` ] ] .pull-right[ <img src="data:image/png;base64,#images/svi_mo_map.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Further Customizations .pull-left[ .tiny[ ``` r # Added theme_void() # to remove grid and grey background ggplot(data = mo_shape_file_v2, aes(fill = RPL_THEMES)) + geom_sf() + theme_void() ``` ] ] .pull-right[ <img src="data:image/png;base64,#images/svi_mo_map_v2.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Further Customizations Part 2 .pull-left[ .tiny[ ``` r # Further customizes labels and color gradient ggplot(data = mo_shape_file_v2, aes(fill = RPL_THEMES)) + geom_sf() + theme_void() + scale_fill_gradient(low="#1fa187", high="#440154") + labs(fill='MO-Specific SVI') ``` ] ] .pull-right[ <img src="data:image/png;base64,#images/svi_mo_map_v3.png" width="100%" style="display: block; margin: auto;" /> ] Note: "#1fa187" and "#440154" above are what are called hexadecimal representation of colors. An excellent detailed guide on colors in R can be found by clicking [this resource from UCSB](https://www.nceas.ucsb.edu/sites/default/files/2020-04/colorPaletteCheatsheet.pdf). --- ## tigris package shapefiles In the previous example, we used tracts(state = "MO") to get the tract-specific shapefile for MO. Many other shape files are available. Two key examples: For state-level map: states(). For county-level map: counties(). To the best of my knowledge, there are 40-50 shapefiles available (e.g. AIANNH [American Indian, Alaska Native and Native Hawaiian] boundaries, zip code tabulation area (ZCTA) boundaries). Speaking of ZCTA, a brief comment on ZCTA. --- ## Detour: Zip Code Tabulation Area (ZCTA) ACS datasets are also available at the ZCTA-level. This might sound like the zip codes we use in our addresses. But they are not the same. Most of the time, your postal zip code is the same as the ZCTA. Zip codes primarily used by the Postal Service for P.O. boxes will likely belong to a different ZCTA. The same is true for areas with few residential addresses (or areas that are primarily occuppied by commercial businesses). There are available crosswalks out there that can help you convert between postal zip codes and ZCTA (e.g., [crosswalk from censusreporter](https://github.com/censusreporter/acs-aggregate/blob/master/crosswalks/zip_to_zcta/ZIP_ZCTA_README.md)). __Please talk to a geographer or demographer before doing any comprehensive work with zip codes or ZCTA.__ --- ## tidycensus package and census data. tidycensus is an R package that allows users to interface with a select number of the US Census Bureau’s data APIs and return data frames. If you want to map ACS-related data, the tidycensus package is the most convenient way to go. One of the advantages of using tidycensus is it has the option to return not just the requested variable(s) but also the corresponding shapefile needed. If your goal is to visualize Census data via a map, tidycensus is the package to use. Before going further, load the tidycensus package: ``` r # run install.packages("tidycensus") if not installed library(tidycensus) ``` --- ## tidycensus package and census data. The script below uses `load_variables()` to list the available variables within the 2023 ACS5 data--this is a table shell but loaded into R as a data frame. Remember, when someone refers to '2023 ACS 5 data', the estimates actually use data for 2019-2023). ``` r variable_list_2022 <- load_variables(2022, "acs5", cache = TRUE) nrow(variable_list_2022) ``` ``` ## [1] 28152 ``` --- ## tidycensus package and census data Looks familiar? .tiny[ ``` r head(variable_list_2022) ``` ``` ## # A tibble: 6 × 4 ## name label concept geography ## <chr> <chr> <chr> <chr> ## 1 B01001A_001 Estimate!!Total: Sex by Age (Whi… tract ## 2 B01001A_002 Estimate!!Total:!!Male: Sex by Age (Whi… tract ## 3 B01001A_003 Estimate!!Total:!!Male:!!Under 5 years Sex by Age (Whi… tract ## 4 B01001A_004 Estimate!!Total:!!Male:!!5 to 9 years Sex by Age (Whi… tract ## 5 B01001A_005 Estimate!!Total:!!Male:!!10 to 14 years Sex by Age (Whi… tract ## 6 B01001A_006 Estimate!!Total:!!Male:!!15 to 17 years Sex by Age (Whi… tract ``` ] --- ## tidycensus package and census data Advanced recipe: using basic text mining skills in R to find tables related to medicare. .tiny[ ``` r variable_list_2022 |> filter(str_detect(concept, regex("medicare", ignore_case = TRUE))) |> relocate(concept) # relocate() moves concept into the first column ``` ``` ## # A tibble: 24 × 4 ## concept name label geography ## <chr> <chr> <chr> <chr> ## 1 Allocation of Medicare Coverage B992706_001 Estimate!!Total: tract ## 2 Allocation of Medicare Coverage B992706_002 Estimate!!Total:!!Allo… tract ## 3 Allocation of Medicare Coverage B992706_003 Estimate!!Total:!!Not … tract ## 4 Medicare Coverage by Sex by Age C27006_001 Estimate!!Total: tract ## 5 Medicare Coverage by Sex by Age C27006_002 Estimate!!Total:!!Male: tract ## 6 Medicare Coverage by Sex by Age C27006_003 Estimate!!Total:!!Male… tract ## 7 Medicare Coverage by Sex by Age C27006_004 Estimate!!Total:!!Male… tract ## 8 Medicare Coverage by Sex by Age C27006_005 Estimate!!Total:!!Male… tract ## 9 Medicare Coverage by Sex by Age C27006_006 Estimate!!Total:!!Male… tract ## 10 Medicare Coverage by Sex by Age C27006_007 Estimate!!Total:!!Male… tract ## # ℹ 14 more rows ``` ] --- ## tidycensus package Task: Suppose we want to map the median % of household income spent on rent for each state using variable B25071_001. .pull-left[ __What to run for county-level ACS5 data:__ .tiny[ ``` r library(tidycensus) census_api_key("YOUR CENSUS API KEY HERE") shapefile_with_data <- get_acs( geography = "state", variables = "B25071_001", year = 2019, survey = "acs5", geometry = TRUE, shift_geo = TRUE ) ``` ] ] .pull-right[ The key part here is: Make sure geometry = TRUE as the default is FALSE. By setting geometry as TRUE, you are instructing get_acs() to return the final data as an SF object (shapefile) that is ready for map rendering via ggplot2. shift_geo = TRUE is also important as it will compress the distance between the contiguous United States with Alaska, Hawaii, and Puerto Rico. NOTE: when we looked at the table shells, we added 'E' at the end of the variable name when using the censusapi package. For tidycensus, it is not required. ] --- ## Rendering the map .pull-left[ .tiny[ ``` r ggplot(data = shapefile_with_data, aes(fill = estimate)) + geom_sf() + theme_void() + labs(fill='Median Gross Rent as a % of Household Income') + scale_fill_gradient(low="#1fa187", high="#440154") + theme(legend.position="bottom") ``` ] ] .pull-right[ <img src="data:image/png;base64,#images/rent_map.png" width="100%" style="display: block; margin: auto;" /> ] Note: "#1fa187" and "#440154" above are what are called hexadecimal representation of colors. An excellent detailed guide on colors in R can be found by clicking [this resource from UCSB](https://www.nceas.ucsb.edu/sites/default/files/2020-04/colorPaletteCheatsheet.pdf). --- ## Importance of shifting geometry I mentioned earlier that shift_geo = TRUE is important. Here's the map you'd generate without setting that argument as TRUE. <img src="data:image/png;base64,#images/without_shift.png" width="100%" style="display: block; margin: auto;" /> --- ## Rendering the map: Example 2 (Full Template) .tiny[ ``` r library(tidycensus) library(tidyverse) census_api_key("YOUR CENSUS API KEY HERE") ct_shapefile_with_data <- get_acs( geography = "county", state = "CT", variables = "B25071_001", year = 2019, survey = "acs5", geometry = TRUE # shift_geo is not needed if you're not mapping entire US ) ggplot(data = ct_shapefile_with_data, aes(fill = estimate)) + geom_sf() + theme_void() + labs(fill='Median Gross Rent as a % of Household Income') + scale_fill_gradient(low="white", high="black") ``` ] See next slide for the rendered map. --- ## Rendering the map: Example 2 <img src="data:image/png;base64,#images/ct_map.png" width="100%" style="display: block; margin: auto;" /> --- ## Impending syntax change for tidycensus: nationwide map .tiny[ __Current syntax:__ ``` r shapefile_with_data <- get_acs( geography = "state", variables = "B25071_001E", year = 2019, survey = "acs5", geometry = TRUE, shift_geo = TRUE ) ``` __Future release syntax:__ ``` r shapefile_with_data <- get_acs( geography = "state", variables = "B25071_001E", year = 2019, survey = "acs5", geometry = TRUE ) |> shift_geometry() ``` ] --- ## Other functions from tidycensus `get_estimates()` can give you detailed information about population characteristics. In your own time, try changing the value of `product` to the following: "components", "population", or characteristics". .tiny[ ``` r get_estimates(geography = "state", product = "components", vintage = 2023) ``` ``` ## Using the Vintage 2023 Population Estimates ``` ``` ## # A tibble: 676 × 5 ## GEOID NAME variable year value ## <chr> <chr> <chr> <int> <dbl> ## 1 01 Alabama BIRTHS 2023 58251 ## 2 01 Alabama DEATHS 2023 59813 ## 3 01 Alabama NATURALCHG 2023 -1562 ## 4 01 Alabama INTERNATIONALMIG 2023 5384 ## 5 01 Alabama DOMESTICMIG 2023 30744 ## 6 01 Alabama NETMIG 2023 36128 ## 7 01 Alabama RESIDUAL 2023 -1 ## 8 01 Alabama RBIRTH 2023 11.4 ## 9 01 Alabama RDEATH 2023 11.7 ## 10 01 Alabama RNATURALCHG 2023 -0.307 ## # ℹ 666 more rows ``` ] --- ## Other functions from tidycensus `get_flows()` provides detailed migration flow data (if available). .tiny[ ``` r get_flows( geography = "county", state = "NY", county = "New York", year = 2019 ) ``` ``` ## # A tibble: 2,019 × 7 ## GEOID1 GEOID2 FULL1_NAME FULL2_NAME variable estimate moe ## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> ## 1 36061 <NA> New York County, New York Africa MOVEDIN 468 182 ## 2 36061 <NA> New York County, New York Africa MOVEDOUT NA NA ## 3 36061 <NA> New York County, New York Africa MOVEDNET NA NA ## 4 36061 <NA> New York County, New York Asia MOVEDIN 9911 1039 ## 5 36061 <NA> New York County, New York Asia MOVEDOUT NA NA ## 6 36061 <NA> New York County, New York Asia MOVEDNET NA NA ## 7 36061 <NA> New York County, New York Central Amer… MOVEDIN 1553 857 ## 8 36061 <NA> New York County, New York Central Amer… MOVEDOUT NA NA ## 9 36061 <NA> New York County, New York Central Amer… MOVEDNET NA NA ## 10 36061 <NA> New York County, New York Caribbean MOVEDIN 2783 712 ## # ℹ 2,009 more rows ``` ] --- class: center, middle ## Post-class exercise ideas: Use tidycensus to create other maps. Use the tigris package with outside data you want to create a map for, and create a map of your own choosing! If you want a comprehensive practice with a lot of problems to work on, work on the `PostClassPractice.qmd` file. It is probably a solid 1-2 hour of work that will eventually ask you to create multiple maps! If you work on the problems and you get stuck, email me at `jfeliciano@american.edu` and I'd be happy to assist you!